This analysis is an exploration of the red wine quality data provided by Paulo Cortez (Univ. Minho). This dataset is a collection of eleven characteristics that constitute to its score.
You can find more information about these characteristics here.
Input variables (based on physicochemical tests):
1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3)
3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3)
5 - chlorides (sodium chloride - g / dm^3
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulphates (potassium sulphate - g / dm3)
11 - alcohol (% by volume)
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
| Really Bad | Bad | Average | Good | Really Good |
|---|---|---|---|---|
| 1-2 | 3-4 | 5-6 | 7-8 | 9-10 |
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
## [1] 1599 12
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## # A tibble: 12 × 3
## bin n freq
## <int> <int> <dbl>
## 1 5 9 0.0056
## 2 6 62 0.0388
## 3 7 291 0.1820
## 4 8 496 0.3102
## 5 9 300 0.1876
## 6 10 188 0.1176
## 7 11 118 0.0738
## 8 12 76 0.0475
## 9 13 39 0.0244
## 10 14 12 0.0075
## 11 15 3 0.0019
## 12 16 5 0.0031
The distribution of fixed acidity is slightly skewed. The distance between the third quadrant and the max value indicate some variability at the right tail. The mean value is larger at 8.32 with the median is slightly smaller at 7.90. About 31% of the the wine have a fixed acid level around 8 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
## # A tibble: 8 × 3
## bin n freq
## <fctr> <int> <dbl>
## 1 <.2 425 0.2658
## 2 0.4 650 0.4065
## 3 0.6 400 0.2502
## 4 0.8 83 0.0519
## 5 1 17 0.0106
## 6 1.2 3 0.0019
## 7 1.4 1 0.0006
## 8 NA 20 0.0125
Volatile acidity is positively skewed with a max value of 1.58 g/dm^3. The median and the mean for this distribution is almost the same at 0.52 g/dm^3 and they are both below 0.6 g/dm^3. Roughly 40% of the wine have a volatile acidity of 0.4 g/dm^3.
After transformed with a log scale, you can see a spike around 0.6 g/dm^3. 90% of the wines have a concentration of acetic acid equal to or below 0.6 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
## # A tibble: 10 × 3
## bin n freq
## <fctr> <int> <dbl>
## 1 0 306 0.1914
## 2 0.1 193 0.1207
## 3 0.2 291 0.1820
## 4 0.3 234 0.1463
## 5 0.4 253 0.1582
## 6 0.5 112 0.0700
## 7 0.6 62 0.0388
## 8 0.7 15 0.0094
## 9 0.9 1 0.0006
## 10 NA 132 0.0826
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 132 rows containing non-finite values (stat_bin).
Citric acid appear to be bimodal. One major peak at 0 and another peak just below 0.25 g/dm^3. Transforming this distribution, we dropped some values. Here the mean and the median is extremely close. The majority of the wine have a citric level below 0.6 g/dm^3. We also see that there are 132 NA values. Let’s check out these values.
## Mode FALSE TRUE NA's
## logical 132 1467 0
Now we can confirmed that there are 132 observations with 0 citric acid.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
## # A tibble: 13 × 3
## bin n freq
## <fctr> <int> <dbl>
## 1 <1 2 0.0013
## 2 2 618 0.3865
## 3 3 764 0.4778
## 4 4 90 0.0563
## 5 5 41 0.0256
## 6 6 36 0.0225
## 7 7 19 0.0119
## 8 8 8 0.0050
## 9 9 10 0.0063
## 10 11 3 0.0019
## 11 13 1 0.0006
## 12 14 4 0.0025
## 13 NA 3 0.0019
The distribution for residual sugar is very concentrated around the mean and the median.This distribution has a very long tail. Much of the variability lie between the 3rd quadrant with a value of 2.60 g/dm^3 and the upper range of 15.5 g/dm^3.
The transformed distribution show residual sugar average concentration around 2.5 g/dm^3. Most of these variabilities have a very low concentration of residual sugar. 70% of the data lie between 2 g/dm^3 and 3 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## # A tibble: 6 × 3
## bin n freq
## <fctr> <int> <dbl>
## 1 <.1 1376 0.8605
## 2 0.2 182 0.1138
## 3 0.3 19 0.0119
## 4 0.4 9 0.0056
## 5 0.5 11 0.0069
## 6 NA 2 0.0013
Chlorides appear to be normal with high a variability between the third quadrant and the upper range of 0.6 g/dm^3. It also has outliers with value at 0.01 g/dm^3. About 86% of the wine have less than or equal to 0.1 g/dm^3 of chlorides.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
## # A tibble: 8 × 3
## bin n freq
## <fctr> <int> <dbl>
## 1 10 605 0.3784
## 2 20 555 0.3471
## 3 30 276 0.1726
## 4 40 122 0.0763
## 5 50 25 0.0156
## 6 60 12 0.0075
## 7 70 3 0.0019
## 8 NA 1 0.0006
Free sulfur dioxide is positively skewed with some outiers trailing. Most of the data is between 10 mg/dm^3 and 30 mg/dm^3. The mean is at 15.87 mg/dm^3 and the max value is 72 mg/dm^3.
The transformed free sulfur dioxide distribution show us a little bit more on the first quadrant where approximately 37% of the wine is.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
## # A tibble: 11 × 3
## bin n freq
## <fctr> <int> <dbl>
## 1 0 239 0.1495
## 2 16 454 0.2839
## 3 32 328 0.2051
## 4 48 203 0.1270
## 5 64 127 0.0794
## 6 80 105 0.0657
## 7 96 63 0.0394
## 8 112 35 0.0219
## 9 128 27 0.0169
## 10 144 15 0.0094
## 11 NA 3 0.0019
Total sulfur dioxide appear to be in a similar situation with free sulfur dioxide where it is positively skewed with a high outlier value near 290 mg/dm^3. The median is at 38 mg/dm^3 and the mean is at 46.47 mg/dm^3.
Looking closer look at total sulfur dioxide with a log scale, it appears to be normally distributed. About 15% of the data is on each side of tail and 75% in the middle.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
## # A tibble: 5 × 3
## bin n freq
## <fctr> <int> <dbl>
## 1 0.99 405 0.2533
## 2 0.995 396 0.2477
## 3 1 396 0.2477
## 4 1.005 400 0.2502
## 5 NA 2 0.0013
Density is evenly distributed with a mean and median that is approximately the same at 0.99 g/cm^3. The data is almost symmetrical.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
## # A tibble: 4 × 3
## bin n freq
## <fctr> <int> <dbl>
## 1 0 178 0.1113
## 2 0.5 1363 0.8524
## 3 1 50 0.0313
## 4 1.5 8 0.0050
Sulphate is another feature with most of the data points around .6 g/dm3. The spike accounts for ~85% of the wine around the mean. With a log transformation, it appears evenly distributed with a few outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
## # A tibble: 4 × 3
## bin n freq
## <fctr> <int> <dbl>
## 1 8 747 0.4672
## 2 10 711 0.4447
## 3 12 140 0.0876
## 4 14 1 0.0006
Alcohol distribution is also positively skewed with a few really low values. However, most of the wine have an average of 10.42 % of alcohol by volume. ~90% of the the wine have an alcohol content between 8 and 10 % by volume.
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## # A tibble: 6 × 3
## quality n freq
## <ord> <int> <dbl>
## 1 3 10 0.0063
## 2 4 53 0.0331
## 3 5 681 0.4259
## 4 6 638 0.3990
## 5 7 199 0.1245
## 6 8 18 0.0113
The quality distribution appear positively skewed peaking between 5 and 6 with a median of 5.67. There are no wine that were really bad nor really good. There are a few wines that scored a 3 and an 8 but most are average. Approximately 80% of the wine has a rating between 5 and 6.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.270 7.827 8.720 9.118 10.070 17.050
## # A tibble: 7 × 3
## bin n freq
## <fctr> <int> <dbl>
## 1 4 20 0.0125
## 2 6 469 0.2933
## 3 8 698 0.4365
## 4 10 282 0.1764
## 5 12 107 0.0669
## 6 14 18 0.0113
## 7 16 5 0.0031
Here I’ve created a variable called acidity. Although the distribution of acidity level is getting closer to a normal distribution, it is still positively skewed. About 40% of the wine have an acidity level around 8.72 g/dm^3.
| Observations | Variables |
|---|---|
| 1599 | 13 |
Note:
qualityis a discrete categorical variable based on sensory data.Xare observation ids and all others are numerical values based on physicochemical tests.
The main feature of this dataset is quality. With this analysis, I seek to uncover the features that contribute to good tasting wine.
One of the characteristics about wine is the “body” and this describes the texture of wine in the mouth. A feature that influences texture is the alcohol content. The higher alcohol content is described as giving wine a richer feel. Another characteristic that stood out is acidity. Acidity give wines a tart and crisp taste. I suspect citric acid, volataile acidity and fixed acidity are important because they can effect the flavors of wine.
I created the acidity variable where I combined fixed.acidity, volatile.acidity, and citric.acid. I wanted to see their distributions combined. If they are normally distributed or skewed. Since fixed acidity and volatile acidity are both positively skewed, the distributioni of all three acids are postively skewed as well.
Working with a tidy dataset, I did not have to perform additional transformation. While investigatiing, I’ve encountered a few interesting observations.
In this section, let’s look at a few variables with quality to get a feel of their correlation. Since alcohol effects the mouth feel of wine, let’s look at that first.
##
## Pearson's product-moment correlation
##
## data: as.numeric(quality) and alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
There is a positive trend with alcohol and quality. In good quality wine, the median can be as high as 12% by volume. Interestingly, for most wine that scored a 5, the median alcohol content is the lowest and it’s below 10%. This wine quality also has some extreme outliers up to 15%. So far, alcohol and quality have a correlation of 0.48.
Using a density plot, we can see the number of wines with a lower quality are more likely to have an alcohol concentration around 9.5 % by volume. For good tasting wines the alcohol content are between 11% to 13%.
The filled density plot indicate a clear positive trend for good quality wine. Average wines also have a similar trend but tend to have more wines with lower alcohol concentration.
##
## Pearson's product-moment correlation
##
## data: as.numeric(quality) and fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
The amount of fixed acidity increases steadily between average wine and good wine. Here we also see that good wines and bad wine have a wide range in their IQR. This feature’s correlation coefficient is low at 0.12.
It’s interesting to see that most wines in this data set have a fixed acidity level around 7.5. This could be the case if all the grapes were harvested and processed at the same time.
##
## Pearson's product-moment correlation
##
## data: as.numeric(quality) and volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
With a dramatic decrease in concentration, volatile acidity stabilize around 0.4 g/dm^3 for wines with a rating beween 7-8. It comes right after alcohol with a correlation at -0.39. Volatile acidity is the concentration of acetic acid in the wine. This gives the wine a vinegar taste at high concentration. Volatile acidity is mostly caused by bacteria.. Noticed that in bad tasting wines, this concentration is much higher.
The density plot also show that bad wines have a higher volatile acidity concentration. The range for this concentration is also wider. The average and good wines tend to have a lower concentration of volatile acidity around 0.3 to 0.4 g/dm^3.
##
## Pearson's product-moment correlation
##
## data: as.numeric(quality) and citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
As expected, citric acid is pretty significant to the taste of wine. Although the range for each rating is somewhat similar, bad wine has as low as 0.05 g/dm^3 and good wine has up to 0.4 g/dm^3. These plots also indicate variabily in the amount of citric acid with long whiskers. I’m surprised to see that the correlation coefficient is at 0.22. This is much lower than alcohol.
Citric acid can range from 0 to 0.8 g/dm^3. Citric acid in better tasting wine tend to be around 0.5 g/dm^3. For most cases of bad wine, there is no present of citric acid, this could explain that they have no flavor. Similarly, the average wines also have less citric acid in them than the good wines.
##
## Pearson's product-moment correlation
##
## data: as.numeric(quality) and acidity.level
## t = 4.1688, df = 1597, p-value = 3.227e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.05501239 0.15200177
## sample estimates:
## cor
## 0.1037537
It is interesting to see quite a bit of variance with each rating. Most of these wines have a median acidic level between 8.5 and 9.5. The range of acidity level for average wine is broad with outliers as high as 17 g/dm^3. I was hoping to see a stronger correlation with different acids combined but it only has a value of 0.10.
This is similar to the fixed acidity level where most of the wines have a concentration value 8 g/dm^3.
Now that we’ve looked at different features against quality, let’s take a look at a matrix for other relationships.
The matrix revealed a few stronger relationships between other features that were not present when we compare against just quality. Noticed that three out of four prominent relationships are related to fixed acid such as citric acid, density and pH.
When plotting the different features against quality, I was surprised to see that most of the relationships have a low correlaltion coefficient. Although alcohol and quality have a correlation value of 0.48, I’m not convinced that it’s the dominating factor that impact the taste of wine.
When observing the overall acicity level with quality, the correlation coefficient is also very low. I was expecting a much stronger correlation. Perhaps the overall acidity level indirectly influences other features like pH or density which in turn can change the alcohol content.
After examining the matrix above, there are some strong relationships between fixed acidity and other variables like citric acid volatile acidity and sulfur dioxide. This could be an indication to some lurking variable that indirectly influences the wine quality.
One of the strongest relationship observed from the matrix is pH and fixed acidity at -0.68. Follow by fixed acidity and citric acid, fixed acidity and density, and free sulfur dioxide and total sulfur dioxide. All have a positive correlation efficient of 0.67. The matrix also confirmed that the strongest relationship to quality is alcohol content with a correlation coefficient of 0.48.
In the previous section, we’ve seen a few relationships with various acids. In this section, let’s explore those features and how they influence the quality of red wine.
Having a -0.68 correlation coefficient, a high concentration of fixed acidity seem to drive the pH level down making the wine more acidity. Here I’ve choosen a diverging color palette that put more emphasisis the lower and higher quality wine. While most of the average wines have a pH between 3 and 3.5 with fixed acid concentration ranging from 6 g/dm^3 to 10 g/dm^3, the good wines follow a general relationship as well.
The relationship between density and fixed acidity is another strong feature. This is a positive relationship with a 0.67 correlation coefficient. The amount of fixed acids here influences how dense with wine is. The higher the concentration, the denser the wine. Looking at the good quality wines, we can see that most are concentrate towards the lower side and the average wines are mostly at the top. Density can contribute to the body of wine. This could make wine feel more viscose in the mouth.
The concentration of fixed acidity in good wine is also lower around 8 g/dm^3 to 10 g/dm^3. Fixed acidity in this dataset measures the level of tartaric acid in the wine. Tartaric acid influences the taste and feel of the wine. It also kills undesirable bacteria by lowering the pH. A high concentration of it can cause wine to feel dense in the mouth.
Fixed acidity and citric hold the strongest positive relationship. As the concentration of citric acid increases, the concentration of fixed acid also increases. Here, the average wines are mostly in the lower region of citric acid and fixed acid. The good quality wine mostly have a citric level above 0.25 g/dm^3 to 0.75 g/dm^3.
Similarly, the acidity level of quality resembles fixed acidity. Most of the good wines have a lower density and the average wines are higher in density.
This negative relationship between citric acid and volatile acidity has a correlation of 0.55. You can see that most bad wines have little presence of citric acid. These wines also have a higher concentration of volatile acidity. The good wines have a decent amount of citric acid and the level of volatile acids are around 0.4 g/dm^3 and below.
With a correlation at 0.67, free sulfur dioxide and total sulfur dioxide share a positive relationship. Higher concentration of free sulfur dioxide also mean a higher concentration of total sulfur dioxide. Here, we see that the good wines range between 5 mg/dm^3 to 30 mg/dm^3 in free sulfur dioxide and also have a lower total sulfur dioxide concentration. The average wines tend to have a lower a free sulfur dioxide but high in total sulfur dioxide.
Now that we’ve looked at different features against quality, let’s take a look at a multivariate matrix for these features.
This matrix revealed a very interesting thing. We can see that citric acid and fixed acidity share the strongest relationship out of all the features. For wines that have an 8 quality rating, the correlation coefficient is 0.89, however, for wines that have a quality rating of 3 that coefficient is at 0.96. Density and fixed acids’ relationship is also similar in which the correlation coefficient is strong for the good wines and bad wines. This makes it difficult to determine the effective threshold of these features on wine quality.
m1 <- lm(as.numeric(quality) ~ alcohol, data = wine_acidity)
m2 <- update(m1, ~. + acidity.level)
summary(m2)
##
## Call:
## lm(formula = as.numeric(quality) ~ alcohol + acidity.level, data = wine_acidity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8615 -0.3969 -0.1303 0.5265 2.5950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.743555 0.199062 -3.735 0.000194 ***
## alcohol 0.367718 0.016517 22.263 < 2e-16 ***
## acidity.level 0.059973 0.009604 6.244 5.44e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7021 on 1596 degrees of freedom
## Multiple R-squared: 0.2452, Adjusted R-squared: 0.2442
## F-statistic: 259.2 on 2 and 1596 DF, p-value: < 2.2e-16
Looking at multiple features against quality, we can see that there are multiple features that effects how wine taste. Since fixed acidity is the amount of tartaric acid present in wine and citric acid is a flavor enchancer to “flat” tasting wines, these two features greatly impact how wine taste. The higher amount of citric acid also contains a higher amount of fixed acid. Higher amount of fixed acid lead to a more dense wine with a lower pH level. This is consistent with what we observed in the previous section.
Free sulfur dioxide and total sulfur dioxide also have a strong interaction with each other. These features measure the amount of S02. This is an anti-microbial agent used to regulate growth of harmful yeast. Too much of sulfites could also stop the fermentation process. Although this is a positive relationship, the plot show a high dispersion of bad and average wines. This high concentration could cause wine to smell very unpleasant.
Here I also created two models. One with alcohol and the other with acidity level. Although the result indicated that both alcohol and acidity level have a significant p-values but the R-squared indicate that these models are not a good fit in predicting wine quality with alcohol and acidity. This could possibly be that most of the wine tasted are average and there were much variability in each category of wine quality.
The level of alcohol concentration peeks at 14.9 % by volume for good wines, the average wines tend stay between 9% to 11%. This red wine sample data has a big range of variability but it is a positive correlation. For these wines, the higer alcohol content tend to get a better rating. Wine with a higher alcohol concentration also have more body and less watery. Since alcohol allow you to feel heat from the wine this could possibly explain that it enchances the aroma and texture of the wine and allowing the tasters to experience more complex flavors.
The presence of acetic acid is due to yeast and bacterial metabolism. It contributes to the smell and taste of wine. It is apparent that the better tasting wines have a lower amount of volatile acids in them. A high amount of volatile acidity gives wine an unpleasant vinegar taste. In this case, the average wines tend to be around 0.4 g/dm^3 to 0.8 g/dm^3.
Volatile acidity or acetic acid also has a strong relationship with citric acid. In this case, we see the average and bad wines tend to have more acetic acid. While citric acid gives wine “freshness”, the downside to it is that it promotes the growth of unwanted microbes. With the high amount of acetic acid these wine, it is possible that these wine have more vinegar taste resulting from the two acids thus getting a bad rating.
We can see that citric acid is prominent in all the quality levels. Most of the bad tasting wine in this data set doesn’t have enough citric acid. Recall that the level of citric acid for better tasting wines have a median of ~0.4 g/dm^3 from 0 to 1. The level of citric acid for bad wines in this case are lower than the median.
This dataset has given me some insights on how different elements can impact the taste of wine although this is purely subjective based on the taster.
I started the exploration by plotting every feature with a histogram to get a better understanding of their distributions. I noticed that many of these features are positively skewed. I was interested in alcohol because alcohol show a strong influence in wine quality. I also suspect that the various acids influence how wine taste. After plotting these features with quality and calculating the correlation coefficient, I found that none of these variables directly drives quality.
I used the density plot to have a better look at the different distributions with quality and that allow me to get a better idea of how the various acids affect the quality. These plots helped me see a whole different relationship between onen variable and another in the data that the histogram did not capture. Once I started investigating the different acids, I learned that they can make wine more complex. I was excited to see these acids to each other but I was also overwhelmed because it was difficult for me to pin point what causes what. Depending on the amount of each acid present in wine, they can enchance the wines characteristics or turn the wines into something unpleasant to drink and smell.
I created a new variable to look at the overall acidity in the wine. I was hoping together they would have a stronger correlation to quality but the correlation did not budge much.
In addition, I also calculated several matrixes and got distracted with stronger correlelations that did not surface in the matrix shown here. Due to the complexity and numerous amount of features in this dataset, I’ve ommitted to keep the analysis more focused.
While analyzing this dataset, I keep wanting to compare it to other red wine data. This dataset would benefit from having additional wine data from other regions like Sonoma or Napa Valley. Also having more data on really bad or really good wine would be extra nice. Next time when I buy wine, I might look for a bottle from Portugal and a local winery to see how different they are.
Data:
Data Citation
GGally
R:
Density plots
dplyr summarise
summarise_each Interpreting linear models in R Linear Models corrplot
Wine knowledge:
Waterhouse Lab
Sulfur in wine
Wine Chemistry
Wine flavors
Volatile acidity
Wine making
Chlorides in wine
Citric acid
Sulfur dioxide
Wine characteristics